Day 3 [Python ML] 選擇建模用的資料(DecisionTree)

2021 iThome 鐵人賽

DAY 3

AI & Data

使用python學習Machine Learning系列第 3 篇

13th鐵人賽

guancioul

團隊人工逗點智慧

2021-09-16 19:47:21

2283 瀏覽

分享至

前言

一開始先接續昨天讀取資料的部分，先使用pd.read_csv來讀取資料
再利用DataFrame的columns來看有哪些columns

import pandas as pd

melbourne_file_path = './Dataset/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

將缺失值去掉

由於這份資料有一些缺失值，後面會學到如何處理缺失值
而這邊則是先用最簡單的方法，就是只要有缺失值就直接把那一個row去掉

# 將缺失值去掉
melbourne_data = melbourne_data.dropna(axis=0)

從圖中可以看到去掉缺失值後數量從13580減少到6196

選擇要預測的目標

# 將Price的資料放入y的變數中
y = melbourne_data.Price
# 選擇需要的feature
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
# 將選擇的feature放入x的變數中
X = melbourne_data[melbourne_features]
# 原資料為數值，describe可以將這些資料轉為count, mean, std, min...
X.describe()

# head可以看到資料中的前幾筆資料
X.head()

建立model

# import DecisionTree這個Model 
from sklearn.tree import DecisionTreeRegressor

# 定義model，使用random_state來確保每次產生的結果都是一樣
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit模型
melbourne_model.fit(X, y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=1, splitter='best')

許多ML的模型都會允許使用一些隨機的model來訓練。指定一個數字給random_state，可以確保每次訓練的結果都是一樣的。不管用哪一個數字都不會影像到模型訓練的結果。

# 將結果print出來
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]